A file format is the way that information is encoded for storage in a computer file. It may describe the encoding at various levels of abstraction including low-level bit and byte layout as well high-level organization such as Markup language and tabular structure. A file format may be standardized (which can be proprietary or open format) or it can be an ad hoc convention.
Some file formats are designed for very particular types of data: PNG files, for example, store Raster graphics images using lossless data compression. Other file formats, however, are designed for storage of several different types of data: the Ogg format can act as a container for different types of multimedia including any combination of sound and video, with or without text (such as ), and metadata. A text file can contain any stream of characters, including possible control characters, and is encoded in one of various character encoding schemes. Some file formats, such as HTML, scalable vector graphics, and the source code of computer software are text files with defined syntaxes that allow them to be used for specific purposes.
If there is no specification available, a developer might reverse engineer the format by inspecting files in that format or acquire the specification for a fee and by signing a non-disclosure agreement. Due to the time and money cost of these approaches, file formats with publicly available specifications tend to be supported by more programs.
In the now-antiquated FAT file system, file names were limited to eight characters for the base name plus a three-character extension, known as an 8.3 filename. Due to the prevalence of this naming scheme, many formats still use three-character extensions even though modern systems support longer extensions. Since there is no standardized list of extensions, more than one format can use the same extension especially for three-letter extensions since there is a limited number of three-letter combinations. This situation can confuse both users and applications.
One implication of indicating the file type with the extension is that the users and applications can be tricked into treating a file as a different format simply by renaming it. For example, an HTML file can be treated as plain text by adding (or changing the existing) extension . Although this strategy is useful, it can be confusing to less technical users who accidentally make a file unusable (or "lose" it). To try to avoid this scenario, Windows and macOS support hiding the extension.
Hiding the extension, however, can create the appearance of multiple files with the same name in the same folder, which is confusing for people. For example, an image may be needed both in format (for publishing) and PNG format (for web sites) and one might name them with the same base name (for example, and ). With extensions hidden they appear to have the same name: .
Hiding extensions can also pose a security risk. For example, a malicious user could create an executable program with an innocent name such as "". The "" would be hidden and an unsuspecting user would see "", which would appear to be a JPEG image, usually unable to harm the machine. However, the operating system would still see the "" extension and run the program, which would then be able to cause harm to the computer. The same is true with files with only one extension: as it is not shown to the user, no information about the file can be deduced without explicitly investigating the file. To further trick users, it is possible to store an icon inside the program, in which case some operating systems' icon assignment for the executable file () would be overridden with an icon commonly used to represent JPEG images, making the program look like an image. Extensions can also be spoofed: some Microsoft Word macro viruses create a Word file in template format and save it with a extension. Since Word generally ignores extensions and looks at the format of the file, these would open as templates, execute, and spread the virus. This represents a practical problem for Windows systems where extension-hiding is turned on by default.
Often intentionally placed information is located at the beginning of a file since this is relatively easy to read from a file both by users and applications. When the information at the beginning of the file is a structure that contains other metadata, then the structure is often called a file header. When the file starts with a relatively small datum that only indicates the format, then it is often called a magic number.
As well as indicating the file format, file headers may contain metadata about the file and its contents. For example, most image files store information about image format, size, resolution and color space, and optionally information such as who made the image, when and where it was made, what camera model and photographic settings were used (Exif), and so on. Such metadata may be used by software reading or interpreting the file during the loading process and afterwards.
File headers may be used by an operating system to quickly gather information about a file without loading it all into memory, but doing so uses more of a computer's resources than reading directly from the directory information. For instance, when a graphic file manager has to display the contents of a folder, it must read the headers of many files before it can display the appropriate icons, but these will be located in different places on the storage medium thus taking longer to access. A folder containing many files with complex metadata such as thumbnail information may require considerable time before it can be displayed.
If a header is Hard coding such that the header itself needs complex interpretation in order to be recognized, especially for metadata content protection's sake, there is a risk that the file format can be misinterpreted. It may even have been badly written at the source. This can result in corrupt metadata which, in extremely bad cases, might even render the file unreadable.
A more complex example of file headers are those used for wrapper (or container) file formats.
The magic number approach offers better guarantees that the format will be identified correctly, and can often determine more precise information about the file. Since reasonably reliable "magic number" tests can be fairly complex, and each file must effectively be tested against every possibility in the magic database, this approach is relatively inefficient, especially for displaying large lists of files (in contrast, file name and metadata-based methods need to check only one piece of data, and match it against a sorted index). Also, data must be read from the file itself, increasing latency as opposed to metadata stored in the directory. Where file types do not lend themselves to recognition in this way, the system must fall back to metadata. It is, however, the best way for a program to check if the file it has been told to process is of the correct format: while the file's name or metadata may be altered independently of its content, failing a well-designed magic number test is a pretty sure sign that the file is either corrupt or of the wrong type. On the other hand, a valid magic number does not guarantee that the file is not corrupt or is of a correct type.
So-called shebang lines in script files are a special case of magic numbers. There, the magic number consists of human-readable text within the file that identifies a specific interpreter and options to be passed to it.
Another operating system using magic numbers is AmigaOS, where magic numbers were called "Magic Cookies" and were adopted as a standard system to recognize executables in Amiga Hunk executable file format and also to let single programs, tools and utilities deal automatically with their saved data files, or any other kind of file types when saving and loading data. This system was then enhanced with the Amiga standard Datatype recognition system. Another method was the FourCC method, originating in OSType on Macintosh, later adapted by Interchange File Format (IFF) and derivatives.
This approach keeps the metadata separate from both the main data and the name, but is also less porting than either filename extensions or "magic numbers", since the format has to be converted from filesystem to filesystem. While this is also true to an extent with filename extensions— for instance, for compatibility with MS-DOS's three character limit— most forms of storage have a roughly equivalent definition of a file's data and name, but may have varying or no representation of further metadata.
Note that zip files and other solve the problem of handling metadata. A utility program collects multiple files together along with metadata about each file and the folders/directories they came from all within one new file (e.g. a zip file with extension ). The new file is also compressed and possibly encrypted, but now is transmissible as a single file across operating systems by FTP transmissions or sent by email as an attachment. At the destination, the single file received has to be unzipped by a compatible utility to be useful. The problems of handling metadata are solved this way using zip files or archive files.
RISC OS uses a similar system, consisting of a 12-bit number which can be looked up in a table of descriptions—e.g. the hexadecimal number is "aliased" to , representing a PostScript file.
The UTI is a Core Foundation string, which uses a reverse-DNS string. Some common and standard types use a domain called (e.g. for a Portable Network Graphics image), while other domains can be used for third-party types (e.g. for Portable Document Format). UTIs can be defined within a hierarchical structure, known as a conformance hierarchy. Thus, conforms to a supertype of , which itself conforms to a supertype of . A UTI can exist in multiple hierarchies, which provides great flexibility.
In addition to file formats, UTIs can also be used for other entities which can exist in macOS, including:
The NTFS filesystem also allows storage of OS/2 extended attributes, as one of the file forks, but this feature is merely present to support the OS/2 subsystem (not present in XP), so the Win32 subsystem treats this information as an opaque block of data and does not use it. Instead, it relies on other file forks to store meta-information in Win32-specific formats. OS/2 extended attributes can still be read and written by Win32 programs, but the data must be entirely parsed by applications.
There are problems with the MIME types though; several organizations and people have created their own MIME types without registering them properly with IANA, which makes the use of this standard awkward in some cases.
This has several drawbacks. Unless the memory images also have reserved spaces for future extensions, extending and improving this type of structured file is very difficult. It also creates files that might be specific to one platform or programming language (for example a structure containing a Pascal string is not recognized as such in C). On the other hand, developing tools for reading and writing these types of files is very simple.
The limitations of the unstructured formats led to the development of other types of file formats that could be easily extended and be backward compatible at the same time.
Throughout the 1970s, many programs used formats of this general kind. For example, word-processors such as troff, Script, and Scribe, and database export files such as CSV. Electronic Arts and Commodore-Amiga also used this type of file format in 1985, with their IFF (Interchange File Format) file format.
A container is sometimes called a "chunk", although "chunk" may also imply that each piece is small, and/or that chunks do not contain other chunks; many formats do not impose those requirements.
The information that identifies a particular "chunk" may be called many different things, often terms including "field name", "identifier", "label", or "tag". The identifiers are often human-readable, and classify parts of the data: for example, as a "surname", "address", "rectangle", "font name", etc. These are not the same thing as identifiers in the sense of a database key or serial number (although an identifier may well identify its as such a key).
With this type of file structure, tools that do not know certain chunk identifiers simply skip those that they do not understand. Depending on the actual meaning of the skipped data, this may or may not be useful (CSS explicitly defines such behavior).
This concept has been used again and again by RIFF (Microsoft-IBM equivalent of IFF), PNG, JPEG storage, DER (Distinguished Encoding Rules) encoded streams and files (which were originally described in CCITT X.409:1984 and therefore predate IFF), and SDXF.
Indeed, any data format must identify the significance of its component parts, and embedded boundary-markers are an obvious way to do so:
Some file formats like ODT and DOCX, being PKZIP-based, are both chunked and carry a directory.
The structure of a directory-based file format lends itself to modifications more easily than unstructured or chunk-based formats. The nature of this type of format allows users to carefully construct files that causes reader software to do things the authors of the format never intended to happen. An example of this is the zip bomb. Directory-based file formats also use values that point at other areas in the file but if some later data value points back at data that was read earlier, it can result in an infinite loop for any reader software that assumes the input file is valid and blindly follows the loop.
|
|